Bayes MCMC Models

Steve Elston

02/28/2021

Review

Bayesian analysis is a contrast to frequentist methods

Review

Two functions must be defined to compute the posterior distribution

Review

Several solution methods; we apply two possibilities

Review

What are the properties of the Beta distribution?

Beta distribution for different parameter values

Beta distribution for different parameter values

Review

Consider the product of a Binomial likelihood and a Beta prior

\[\begin{align} posterior(\theta | z, n) &= \frac{likelihood(z,n | \theta)\ prior(\theta)}{data\ distribution (z,n)} \\ Beta(\theta | z, n, a, b) &= \frac{Binomial(z,n | \theta)\ Beta(a+1, b+1)}{p(z,n)} \\ &= Beta(z + a -1,\ n-z+b-1) \end{align}\]

Review

Consider example with:
- Prior pseudo counts \([1,9]\), successes \(a = 1 + 1\) and failures, \(b = 9 + 1\)
- Evidence, successes \(= 10\) and failures, \(= 30\)
- Posterior is \(Beta(10 + 2 -1,\ 40 - 10 + 10 -1) = Beta(11,\ 39)\)

Prior, likelihood and posterior for distracted driving

Prior, likelihood and posterior for distracted driving

Review

What are the 95% credible intervals for \(Beta(11,\ 39)\)?

Ppsterior density of probability of distract drivers

Ppsterior density of probability of distract drivers

Review

Simulate the probabilities of distracted drivers for the next 10 cars with posterior, \(Beta(11,\ 39)\)?

Probability of distract drivers for next 10 cars

Probability of distract drivers for next 10 cars

Introduction

How can we extend Bayes models to more complex problems?

Problem with Grid Sampling

Real-world Bayes models have large numbers of parameters, into the millions

Introduction to Markov Chain Monte Carlo

Large-scale Bayesian models need highly efficient sampling methods

What is a Markov process?

MCMC sampling uses a chain of Markov sampling processes

What is a Markov process?

Since Markov transition process depends only on the current state a Markov process is memoryless

\[P(X_{t + 1}| X_t = x_t, x_{t-1}, x_{t-2}, \ldots, x_0) = p(X_{t + 1}| x_t)\]

What is a Markov process?

For system with \(N\) possible states we can write the transition probability matrix, \(\Pi\):

\[\Pi = \begin{bmatrix} \pi_{1,1} & \pi_{1,2} & \cdots & \pi_{1, N}\\ \pi_{2,1} & \pi_{2,2} & \cdots & \pi_{2,N}\\ \cdots & \cdots & \cdots & \cdots \\ \pi_{N,i} & \pi_{N,2} & \cdots & \pi_{N,N} \end{bmatrix}\\ where\\ \pi_{i,j} = probability\ of\ transition\ from\ state\ i\ to\ state\ j\\ and\\ \pi_{i,i} = probability\ of\ staying\ in\ state\ i\\ further\\ \pi_{i,j} \ne \pi_{j,i}\ in\ general \]

Example of a Markov Process

To make the foregoing more concrete let’s construct a simple example. We will start with a system of 3 states, \(\{ x_1, x_2, x_3 \}\). The transition matrix is:

\[\Pi = \begin{bmatrix} \pi_{1,1} & \pi_{1,2} & \pi_{1,3}\\ \pi_{2,1} & \pi_{2,2} & \pi_{2,3}\\ \pi_{3,1} & \pi_{3,2} & \pi_{3,3} \end{bmatrix} = \begin{bmatrix} 0.5 & 0.0 & 0.6\\ 0.2 & 0.3 & 0.4\\ 0.3 & 0.7 & 0.0 \end{bmatrix} \]

Example of a Markov Process

Let’s apply a probability matrix to a set of possible states

\[\vec{x}_{t+1} = \Pi\ \vec{x}_t = \begin{bmatrix} 0.5 & 0.0 & 0.6\\ 0.2 & 0.3 & 0.4\\ 0.3 & 0.7 & 0.0 \end{bmatrix} \begin{bmatrix} 1\\ 0\\ 0 \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0.2 \\ 0.3 \end{bmatrix} \]

From Markov process to Markov chain

The foregoing is a single step of a Markov process

From Markov process to Markov chain

Markov chain comprises a number of state transitions

From Markov process to Markov chain

Start with initial state, \(\vec{x}_0\) for a continuous Markov chain:

\[\Pi\ \Pi\ \Pi\ \ldots \Pi\ \vec{x}_t = \Pi^n\ \vec{x}_t \xrightarrow[\text{$n \rightarrow \infty$}]{} \vec{p(x)}\]

MCMC and the Metropolis-Hastings Algorithm

The first MCMC sampling algorithm developed is the Metropolis-Hastings (M-H) algorithm; often referred to as Metropolis algorithm or the M-H algorithm.

\[Acceptance\ probability\ = \frac{p(data | new\ parameters)}{p(data | previous\ parameters)}\].

MCMC and the Metropolis-Hastings Algorithm

Eventually, the M-H algorithms converges to the posterior distribution

MCMC and the Metropolis-Hastings Algorithm

Poor convergence arrises from low sample efficiency

M-H algorithm example

Let’s try a simple example, find an estimate of the posterior density of a bivariate Normal distribution

\[\mu = \begin{bmatrix} .5\\ .5 \end{bmatrix}\]

\[Covariance = \begin{bmatrix} 1, .6\\ .6, 1 \end{bmatrix}\]

Random samples from a bivariate Normal distribution

Random samples from a bivariate Normal distribution

M-H algorithm example

And, the marginal distributions of the variables

Marginal distributions of bivariate Normal samples

Marginal distributions of bivariate Normal samples

M-H algorithm example

Now, we are ready to sample these data using the M-H MCMC algorithm

The algorithm is

  1. Compute the bi-variate Normal distribution likelihood
  2. Initialize the chain
  3. Initialize some hyperparameters statistics
  4. Sample the likelihood of the data using the M-H algorithm.
MCMC samples from bivariate Normal distribution

MCMC samples from bivariate Normal distribution

M-H algorithm example

Plot the first 500 samples

First 500 MCMC samples from bivariate Normal distribution

First 500 MCMC samples from bivariate Normal distribution

Notice the long ‘tail’ on the sampled distribution from the burn-in period.

M-H algorithm example

Marginal distributions of the MCMC samples, less first 500

First 500 MCMC samples from bivariate Normal distribution

First 500 MCMC samples from bivariate Normal distribution

These marginals are similar to the original distribution

Convergence and sampling efficiency of MCMC

Hoe can we understand the onvergence properties of the M-H MCMC sampler

For our example these statistics are fairly good:

\[Acceptance\ rate = 0.81\] \[Rejection\ rate = 0.19\]

Evaluation of MCMC sampling

Trace plot of the samples displays the sample value of the parameter with sample number

Trace plots for two variables from the M-H algorithm

Trace plots for two variables from the M-H algorithm

The traces show sampling around the highest density values of the parameters, indicating good convergence

Evaluation of MCMC sampling

An autocorrelation plot shows how a sample value is related to the previous samples

Autocorrelation of Markov chain for the two parameters

Autocorrelation of Markov chain for the two parameters

Evaluation of MCMC sampling

We can relate sampling efficiency to the autocorrelation of the samples

\[ESS = \frac{N}{1 + 2 \sum_k ACF(k)}\]

Other MCMC Sampling Algorithms

A number of other powerful MCMC sampling algorithms have been developed

Gibbs sampling

Gibbs sampler is an improved MCMC sampler which speeds convergence

Gibbs sampling

The basic Gibbs sampler algorithm has the following steps:

  1. For an N dimensional parameter space, \(\{ \theta_1, \theta_2, \ldots, \theta_N \}\), find a random starting point

  2. In order, \(\{1, 2, 3, \ldots, N\}\), assign the next dimension to sample, starting with dimension \(1\); actual order not important

  3. Sample the marginal distribution of the parameter given the observations, \(D\), and other parameter values: \(p(\theta_1|D, \theta_2, \theta_3, \ldots, \theta_N)\)

  4. Repeat steps 2 and 3 until convergence

Hamiltonian MCMC

The Hamiltonian sampler was proposed in the early 1980s and uses a simple idea from classical mechanics

No U-Turn Sampler

NUTS represents the state of the art in MCMC samplers

Hierarchical modeling

So far, we have only worked with models having a simple flat parameter structure

Chain rule of probability

Use the chain rule of probability to factor a joint distribution into hierarchy

\[P(A,B) = P(A|B)P(B)\]

\[P(A_1, A_2, A_3, A_4 \ldots, A_n) = P(A_1 | A_2, A_3, A_4, \ldots, A_n)\ P(A_2, A_3, A_4 \ldots, A_n)\]

\[P(A_1, A_2, A_3, A_4 \ldots, A_n) =\\ P(A_1 | A_2, A_3, A_4, \ldots, A_n)\ P(A_2 | A_3, A_4 \ldots, A_n)\\ P(A_3| A_4 \ldots, A_n) \ldots P(A_n)\]

Chain rule of probability

The factorization is not unique.

\[P(A_1, A_2, A_3, A_4 \ldots, A_n) =\\ P(A_n | A_{n-1}, A_{n-2}, A_{n-3}, \ldots, A_1)\ P(A_{n-1}| A_{n-2}, A_{n-3}, \ldots, A_1)\\ P(A_{n-2}| A_{n-3}, \ldots, A_1) \ldots p(A_1)\]

Factorization applied to hierarchical models

Apply the chain rule of probability to Bayes theorem

\[p(\mu, \sigma | D) \propto p(D| \mu, \sigma) p(\mu, \sigma)\]

\[\begin{align} p(\mu, \sigma | D) &\propto p(D | \mu) p(\mu | \sigma) p(\sigma)\\ &\propto\ Likelihood\ *\ Conditional\ Distribution\ of\ \mu\ given\ \sigma\ *\ Prior\ of\ \sigma \end{align}\]

The Gamma distribution

Use the Gamma distribution for the prior of the standard deviation

Density of the Gamma distribution with scale-2.0

Density of the Gamma distribution with scale-2.0

Creating a MCMC hierarchical Bayes Model

We are now ready to create an hierarchical Bayes model

Hierarchical model for the posterior distribution

Hierarchical model for the posterior distribution

Creating a MCMC hierarchical Bayes Model

In mathematical terms we can define the hierarchical model as follows:

  1. Prior of the scale, \(\sigma\), is \(Gamma(2, 2)\); fairly noninformative

  2. Prior of the mean, \(\mu\), is \(N(0, 3)\), fairly dispersed and noninformative

  3. The posterior distribution uses the priors for \(\mu\) and \(\sigma\) and a Normal likelihood

Creating a MCMC hierarchical Bayes Model

Using PyMC3 to sample the posterior distribution

def create_model(y):
    ## define a PyMC3 model object
    model = pymc3.Model()

    with model:
        # Define the prior distributions for the two model parameters, location, mu and scale
        scale = pymc3.Gamma('scale', alpha=2,beta=2)
        mu = pymc3.Normal('mu', mu=0, sigma=3.0)
        
        # The model for the posterior distribution uses the priors of the location and scale
        # to model the observations.   
        y_obs = pymc3.Normal('y_obs', mu=mu, sigma=scale, observed=y)
    return model

Evaluating the solution

Posterior distribution and trace plots for two model parameters

Posterior distribution and trace plots for two model parameters

Evaluating the solution

PyMC3 provides quite a number of model evaluation statistics for 4 chains X 1,000 samples each

The summary shows a lot of useful information:

Summary of MCMC model performance

Summary of MCMC model performance

Evaluating the solution

Credible intervals from MCMC sampling of model parameters

Credible intervals from MCMC sampling of model parameters

Summary

For complex Bayesian models we need a computationally efficient aproximation

Summary

Hierarchical model can be sued to represent complex relationships

Summary

Performance metrics of MCMC sampling